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ABSTRACT 

The glycan fragment database (GFDB), freely 
available at http://www.glycanstructure.org, is a 
database of the glycosidic torsion angles derived 
from the glycan structures in the Protein Data 
Bank (PDB). Analogous to protein structure, the 
structure of an oligosaccharide chain in a glycopro- 
tein, referred to as a glycan, can be characterized by 
the torsion angles of glycosidic linkages between 
relatively rigid carbohydrate monomeric units. 
Knowledge of accessible conformations of biologic- 
ally relevant glycans is essential in understanding 
their biological roles. The GFDB provides an intuitive 
glycan sequence search tool that allows the user to 
search complex glycan structures. After a glycan 
search is complete, each glycosidic torsion angle 
distribution is displayed in terms of the exact 
match and the fragment match. The exact match 
results are from the PDB entries that contain the 
glycan sequence identical to the query sequence. 
The fragment match results are from the entries 
with the glycan sequence whose substructure 
(fragment) or entire sequence is matched to 
the query sequence, such that the fragment results 
implicitly include the influences from the nearby 
carbohydrate residues. In addition, clustering 
analysis based on the torsion angle distribution 
can be performed to obtain the representative struc- 
tures among the searched glycan structures. 

INTRODUCTION 

An oligosaccharide moiety in a glycoprotein, referred to as 
a glycan, comes in a diversity of sequences and structures, 
and specific interactions between carbohydrates and 
proteins are essential in many cellular events (1-3). 
These events require molecular recognition of specific 
carbohydrate structures that seems to be sensitive to 
small differences in carbohydrate structure. For instance, 



the carbohydrate structures found on a host cell receptor, 
which only differ by the sequence of the terminal sugar 
residues, are believed to be a major factor in determining 
the host range (e.g. swine, avian or human) of influenza 
viruses (4,5). In addition, glycosyl transferases and 
glycosidases recognize specific sequences and spatially 
arranged oligosaccharide chains (6,7). Thus, understand- 
ing the carbohydrate conformations will provide insight 
into the role of glycans in modulating many cellular 
events. 

Analogous to protein structure, the structure of 
an oligosaccharide chain can be characterized by the 
torsion angles of glycosidic linkages between relatively 
rigid carbohydrate monomeric units. Considerable 
efforts have been already made to characterize the poten- 
tial energy surface of the peptide bond conformation, and 
the accessible torsion angles of a peptide are well known 
(8-12). However, unlike proteins and peptides where the 
amino acid units are linearly linked together by the 
same peptide bonds, glycans can have branches, and 
each monosaccharide unit can be linked by different 
types of glycosidic linkages. In addition, the lack of 
experimentally derived atomic structures of oligosacchar- 
ides in aqueous solution makes it difficult to characterize 
the accessible torsion angles of a particular glycosidic 
linkage. 

Despite the difficulties involved in crystallization, 
the number of glycoprotein structures deposited in the 
Protein Data Bank (PDB) (13) has been steadily 
increasing (14,15). Although far from complete, glycan 
structures in the PDB can be used to study the accessible 
glycosidic torsion angles (16-19). Unfortunately, however, 
extracting structural information of glycans from the PDB 
is not trivial because of a lack of standardized nomencla- 
ture and the way the data are presented in the PDB (3,14). 
Recently, Siiwen et al. (19) analysed the accessible glyco- 
sidic torsion angles of the a(1^2) linked mannose 
disaccharide using the PDB glycan structures, but they 
had to make considerable efforts to collect and filter out 
erroneous PDB entries. 

In this work, we present the glycan fragment database 
(GFDB), a database of the glycosidic torsion angles 
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derived from the PDB glycan structures. Carbohydrate 
structures in the PDB are recognized by Glycan Reader, 
an automatic sugar identification algorithm that we 
developed (15), instead of using the nomenclature pre- 
sented in the PDB entries. The GFDB provides an intui- 
tive glycan sequence search tool that allows the user to 
search complex glycan structures. After a glycan search 
is complete, each glycosidic torsion angle distribution of 
the searched glycan structures is displayed. In addition, 
the torsion angle distributions can be clustered to 
generate representative structures using the clustering 
analysis facility on the GFDB interface. To facilitate the 
conformational analysis of glycosidic linkages, the GFDB 
also provides various filters. In the following sections, we 
discuss how the glycan structural information was col- 
lected, how to search a glycan sequence and how 
the search results are displayed. A stepwise guide about 
the GFDB is also provided in http://www.gh/canstructure 
.org/fragment-db. 



GLYCAN FRAGMENT DATABASE 

To recognize the PDB entries that contain carbohydrate 
molecules, we used Glycan Reader for automatic sugar 
identification (15). Briefly, in Glycan Reader, topologies 
of the molecules in the HETATM section of a PDB file 
are first generated using the atom connection information 
from the CONECT section. The carbohydrate candidate 
molecules (six-membered ring for a pyranose and 
five-membered ring for a furanose that are composed of 
only one oxygen and carbon atoms) are then identified. 
For each carbohydrate-like molecule, the chemical groups 
attached to each position of the ring and their orientations 
are compared with a pre-defined table to identify the 
correct chemical name for the carbohydrates. Glycan 
chains are constructed by examining the glycosidic 
linkages between the carbohydrate molecules that have 
chemical bonds between them. 

Identified carbohydrate molecules are further analysed 
and recorded in the GFDB. First, the residue name 
annotated in the PDB is compared with the molecular 
structure. The disparity of the residue name annotation 
in the PDB and the actual molecular structure is 
common (14). Although Glycan Reader returns the 
correct carbohydrate names according to the molecular 
structures, such disparity could be a sign of potential 
error. Second, because a distorted ring geometry could 
mislead the interpretation of the glycosidic torsion 
angles, the geometry of the carbohydrate ring is calculated 
by virtual torsion angle definition (20) and is recorded 
whether it is in a chair conformation ('C 4 or 4 Ci). 
Finally, if the carbohydrate molecules have chemical 
groups (phosphate, sulphate, methyl and so forth) 
attached in one of the hydroxyl groups, the carbohydrates 
are marked as derived carbohydrates in the GFDB. The 
entries that belong to these cases can be excluded from the 
search using the filtering options, such as 'misassigned 
residues', 'distorted ring geometry' and 'derived carbohy- 
drates' (see later in the text and also Figure 1). 



WEB INTERFACE 

The GFDB provides a glycan sequence search interface 
that allows the user to search complex glycan sequences 
(Figure 1). The search interface provides a visual guide 
as the user builds a complex glycan query sequence, and 
the interface is compatible with any modern web 
browser with JavaScript capability. There is a report 
generation facility available to generate an archived 
report file that contains all the raw data for a given 
search and 3D structures based on the clustering 
analysis (see later in the text); the user also can get 
the archived report file by email. There are several fil- 
tering functions available (Figure 1), which narrows the 
search results for specific needs, such as filters for only 
N-/0-glycosylated glycans, the resolution of the PDB 
entries and/or the aforementioned three structural 
features (misassigned residues, distorted ring geometry 
and derived carbohydrates). 

When analysing the glycosidic torsion angles in the 
PDB, it is important to understand that there are redun- 
dant PDB entries from the same or similar proteins. 
Without removing those redundant entries, it is possible 
to overestimate the preference of a certain conformation 
for a given glycan sequence. Although redundancies in 
the PDB can be removed by post-processing the data 
obtained by the GFDB, the GFDB provides a prelimin- 
ary filter option for removing such redundant protein 
entries for N-linked or O-linked glycan chains based on 
the sequence similarity of the parent protein. 



Search Glycan Fragment DB 



Search Sequence: 



Any 



[P i ] [ D-NA-glucosc $] [~| fT] 

[ 4 «- i ] f~0 i}[ D-NA-glucosc 7] [~-~] f+~] 
[ 4 - t ] [~p i ] [ D- man nose 7] [~-~] f+~) 
[ 3 — w][a i ) [~D^mannose i ) f^~] ["+"") 
( 6 *- t][ot i || D-mannose ? ] [~^~] [ + ] 



E-mail (optional): 



Generate report 



In the case that you have any difficulties in viewing the result or to keep the results, you can 
generate a archived report file. The report file contains raw data for every torsion angles and 
clustering results. See "How To Use" for more information. If e-mail address is provided, the 
generated report will be sent to the e-mail address as well. 



Filter: 

By Type: 

H N-linked 
O-linked 
Q ligand 



By PDB Info: 

Resolution [i I A 



Method X-ray 
Only after year 



Exclude entries with: 

Misassigned residues 
Distorted ring geometry 
Derived carbohydrates 
Sequence similarity 100% ; 



Sequence Graph: 




14B 



Figure 1. The GFDB search interface. 
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Search Result: 

Found 1 34 glycans that have exact sequence 
Found 440 glycans have the sequence fragment 



Clustering analysis 

Exact Match 
( Download Raw Data) 



Fragment 
( Download Raw Data) 




-ISO -100 -50 



SO 100 ISO 



SO 100 150 



phi phi 

Clustering Result: 

Representative structure (exact match) 

Cluster *^^^^^pownload PDB Download CHARMM^^^ 

Download CHARMM Input 



Cluster #2 (9.7%) 

Cluster #3 (7.5%) 

Ouster #4 (3.0%) 

Cluster #5 (3.0%) 



Download P DB 
Download PDB 
Download PDB 
Download PDB 



Download CHARMM Input 
Download CHARMM Input 
Download CHARMM Input 



Representative structure (fragments) 



Cluster ta (16.5%) Download PDB 

Ouster #3 (8.0%) Download PDB 

Cluster #4 (6.1%) Download PDB 

Cluster #5 (3.2%) Download PDB 



Download CHARMM Input 
Download CHARMM Input 
Download CHARMM Input 
Download CHARMM Input 



Figure 2. An example of the search result for the query sequence in 
Figure 1 . The glycosidic torsion angle distribution of a particular glyco- 
side linkage can be displayed by clicking the glycosidic linkage in 
'Sequence Graph' in Figure 1. The clustering analysis of the glycan 
chain can be performed, and the top-five representative structures can 
be downloaded. The glycosidic torsion angle distribution of a selected 
cluster is shown in red. 



After a glycan search is finished, the interface shows 
two torsion angle distributions side by side (Figure 2), 
'exact match' and 'fragment match' (Figure 3). For the 
exact match, the GFDB first performs a sequence search 
to find the PDB entries that contain the glycan sequence 
identical to the query sequence, and the resulting torsion 
angle values for each glycosidic linkage are displayed to the 
user. On the other hand, the fragment search performs 
a search against the substructures (hence, they are called 
fragments, Figure 3) and returns the entries having at least 
one substructure that matches to the query sequence. 
This provides more samples for the torsion angle 
analysis. The torsion angle values from the fragment 
match always contain the exact match results. However, 
the fragment search results may not be the same as the 
exact match results because part of a glycan structure can 
adopt a different structure when it has extra intra- and 
intermolecular interactions. Therefore, the fragment 
match results implicitly include the influences from the 
nearby carbohydrate residues and different protein-carbo- 
hydrate interactions, such that one can assess the flexibility 
of a certain glycosidic linkage in the context of larger glycan 
chain by comparing the exact and fragment match results. 

The glycosidic torsion angle definition in the GFDB is 
adopted from the crystallographic definition; O5-C1-O1- 
C x (0), Q-OrCVCV) (f), and O r C' 6 -C' 5 -0' 5 (a>). The 
torsion angle between the first residue of the N-glycan 





Figure 3. An example of the exact and fragment matches based on the 
query sequence in Figure 1. (A) The glycan sequence for the exact 
match results. (B and C) Examples of the glycan sequences for the 
fragment match results. The matched substructure is highlighted in 
the red rectangles. The sequence in (A) is also included in the 
fragment match results. 



chain and the side chain of the asparagine residue is 
defined as Os-Ci-N'm-C'c (</>) and Ci-N' D2 -C' g -C'b (f). 
The torsion angle between the first residue of the O-glycan 
chain and the side chain of the serine residue is defined as 
Os-CtO'g-C'b (<j>) and Ci-0' G -C' B -C' A (f). For threo- 
nine, 0 G i is used instead of 0 G - The atom names are 
based on the CHARMM topology. 



CLUSTERING ANALYSIS 

Statistical analysis of the torsion angle values of a particu- 
lar glycosidic linkage is useful to estimate the allowable 
conformations of glycan chains, but it is difficult to under- 
stand what would be the representative (or most probable) 
structures of the given glycan sequence among the avail- 
able PDB glycan structures. To provide useful insight into 
the 3D glycan structure, the GFDB provides an option to 
perform clustering analysis of the torsion angle search 
results and produce the top-five most clustered glycan- 
only structures. 

The GFDB uses a simple clustering method to efficiently 
determine the members of each cluster. The pairwise 
torsion angle differences are first calculated by the follow- 
ing equation: 



N 



(1) 



where N is the total number of glycosidic linkages in a 
glycan sequence, <p k and ^ are the torsion angle values of 
the £-th glycosidic linkage, and i and j represent two glycan 
structures, w torsion angle values are included only for 
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glycosidic linkages that have three rotatable bonds such as 
1-6 linkages. After the pairwise distance matrix of the 
searched glycan structures is calculated, the first cluster is 
identified with the maximum number of neighbours within a 
30° cut-off radius; the cut-off value was empirically 
determined. The second cluster is identified in the same 
manner after excluding the members that belong to the 
first cluster. The result of the top-five clusters and the cor- 
responding 3D glycan structure based on the centroid of 
each cluster is provided to the user along with the input 
files to generate the centroid glycan structures using the 
CHARMM biomolecular simulation program (21). 



CONCLUDING DISCUSSION 

There are several databases that provide information on 
glycan structures or sequences derived from the PDB 
(or from other experiments). Many of these databases, 
such as BCSDB (22), KEGG GLYCAN (23) and 
Glycoconjugate Data Bank (24), store only glycan 
sequence information, whereas the GFDB focuses on the 
3D glycan structure. GlycoMaps DB (25) and GlyTorsion 
(26) provide torsion angle distributions of glycosidic 
linkages derived from computational calculations 
and from the PDB, respectively. Thus, the GlyTorsion 
database is the only database that can be directly 
compared with the GFDB. While the search interface of 
the GlyTorsion database is restricted to only one glyco- 
sidic linkage, the GFDB can search more complex glycan 
sequence with various filter functions and provide the clus- 
tering analysis and the top-five clustered structures. These 
unique features in the GFDB allow researchers to collect 
complex glycan structural information easily and reliably. 

As of August 2012, the GFDB contains 5360 PDB 
entries that contain at least one carbohydrate molecule 
and 20467 glycan chains. Among those glycan chains, 
11 735 (57%) are N-linked glycan chains and 788 (4%) 
are O-linked glycans. And the remaining 7944 (39%) 
exist as ligands. For the glycan structures with more than 
two carbohydrates, the hierarchical fragmentation identi- 
fied a total of 81 370 fragment structures with 4267 unique 
glycan sequences; a unique glycan sequence has more than 
two carbohydrates and is defined by the carbohydrate 
sequence and the glycosidic linkages. There are 30 375 
glycosidic torsion angle values available in the GFDB. By 
providing the straightforward search tool, the filtering 
functions and the clustering analysis for the representative 
structures, we hope that the GFDB can help conform- 
ational analysis of various oligosaccharide chains and 
glycosidic linkages. The database will be updated quarterly 
and is freely available at http://www.glycanstructure.org. 
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